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Abstract 

We  present  a  unified  framework  for  online  learning  of  struetured  elassifiers  that  handles  a  wide 
family  of  eonvex  loss  funetions,  properly  ineluding  CRFs,  struetured  SVMs,  and  the  struetured 
pereeptron.  We  introduee  a  new  aggressive  online  algorithm  that  optimizes  any  loss  in  this  family. 
For  the  struetured  hinge  loss,  this  algorithm  reduees  to  1-best  MIRA;  in  general,  it  ean  be  regarded 
as  a  dual  eoordinate  aseent  algorithm.  The  approximate  inferenee  seenario  is  also  addressed.  Our 
experiments  on  two  NLP  problems  show  that  the  algorithm  eonverges  to  aeeurate  models  at  least 
as  fast  as  stoehastic  gradient  deseent,  without  the  need  to  speeify  any  learning  rate  parameter. 


1  Introduction 


Learning  structured  classifiers  discriminatively  typically  involves  the  minimization  of  a  regularized 
loss  function;  the  well-known  cases  of  conditional  random  fields  (CRFs,  [Lafferty  et  ah,  2001]) 
and  structured  support  vector  machines  (SVMs,  [Taskar  et  ah,  2003,  Tsochantaridis  et  ah,  2004, 
Altun  et  ah,  2003])  correspond  to  different  choices  of  loss  functions.  For  large-scale  settings,  the 
underlying  optimization  problem  is  often  difficult  to  tackle  in  its  batch  form,  increasing  the  pop¬ 
ularity  of  online  algorithms.  Examples  are  the  structured  perceptron  [Collins,  2002a],  stochastic 
gradient  descent  (SGD)  [LeCun  et  ah,  1998],  and  the  margin  infused  relaxed  algorithm  (MIRA) 
[Crammer  et  ah,  2006]. 

This  paper  presents  a  unified  representation  for  several  convex  loss  functions  of  interest  in 
structured  classification  (§2).  In  §3,  we  describe  how  all  these  losses  can  be  expressed  in  variational 
form  as  optimization  problems  over  the  marginal  polytope  [Wainwright  and  Jordan,  2008].  We 
make  use  of  convex  duality  to  derive  new  online  learning  algorithms  (§4)  that  share  the  “passive- 
aggressive”  property  of  MIRA  but  can  be  applied  to  a  wider  variety  of  loss  functions,  including 
the  logistic  loss  that  underlies  CRFs.  We  show  that  these  algorithms  implicitly  perform  coordinate 
ascent  in  the  dual,  generalizing  the  framework  in  Shalev-Shwartz  and  Singer  [2006]  for  a  larger 
set  of  loss  functions  and  for  structured  outputs. 

The  updates  we  derive  in  §4  share  the  remarkable  simplicity  of  SGD,  with  an  important  ad¬ 
vantage:  they  do  not  require  tuning  a  learning  rate  parameter  or  specifying  an  annealing  schedule. 
Instead,  the  step  sizes  are  a  function  of  the  loss  and  its  gradient.  The  additional  computation 
required  for  loss  evaluations  is  negligible  since  the  methods  used  to  compute  the  gradient  also 
provide  the  loss  value. 

Two  important  problems  in  NLP  provide  an  experimental  testbed  (§5):  named  entity  recogni¬ 
tion  and  dependency  parsing.  We  employ  feature-rich  models  where  exact  inference  is  sometimes 
intractable.  To  be  as  general  as  possible,  we  devise  a  framework  that  fits  any  structured  classifi¬ 
cation  problem  representable  as  a  factor  graph  with  soft  and  hard  constraints  (§2);  this  includes 
problems  with  loopy  graphs,  such  as  some  variants  of  the  dependency  parsers  of  Smith  and  Eisner 
[2008]. 

2  Structured  Classification  and  Loss  Functions 

2.1  Inference  and  Learning 

Denote  by  A’  a  set  of  input  objects  from  which  we  want  to  infer  some  hidden  structure  con¬ 
veyed  in  an  output  set  3^.  We  assume  a  supervised  setting,  where  we  are  given  labeled  data 
V  =  {(xi,  j/i), . . . ,  {xmi  Vm)}  ^  X  'X  y.  Each  input  x  e  X  (e.g.,  a  sentence)  is  associated  with 
a  set  of  legal  outputs  y{x)  C  y  (e.g.,  candidate  parse  trees);  we  are  interested  in  the  case  where 
3^(x)  is  a  structured  set  whose  cardinality  grows  exponentially  with  the  size  of  x.  We  consider 
linear  classifiers  he  ■  X  ^  y  of  the  form 

he{x)  =  8j:gma.x6~^  (f)(x,y),  (1) 

y£y(x) 
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where  0  G  is  a  veetor  of  parameters  and  4>{x,y)  G  is  a  feature  veetor.  Our  goal  is  to 
learn  the  parameters  0  from  the  data  V  sueh  that  he  has  small  generalization  error.  We  assume  a 
eost  funetion  ^  :  y  x  y  ^  M+  is  given,  where  y)  is  the  eost  of  predieting  y  when  the  true 
output  is  y.  Typieally,  direet  minimization  of  the  empirieal  risk,  min^giKd  ^  YJlLi  ^{he{xi),  yi),  is 
intraetable  and  henee  a  surrogate  non-negative,  eonvex  loss  L{6]  x,  y)  is  used.  To  avoid  overfitting, 
a  regularizer  R{6)  is  added,  yielding  the  learning  problem 


m 

min  \R{e)  ^ - V  T(0;  Xi,  yi), 

m  ^ 

t=i 


(2) 


where  A  G  M  is  the  regularization  eoeffieient.  Throughout  this  paper  we  assume  £2 -regularization, 
R{0)  =  and  foeus  on  loss  funetions  of  the  form 


L/3,^i0;x,y)  =  ^\og  ^  exp  (^cl){x,y)  -  cj){x,y))  + -fi{y  ,y) 


y'&y{x) 


(3) 


whieh  subsumes  some  well-known  oases: 

•  The  logistic  loss  (in  CRTs),  Lcrf(^;  x,y)  ^  —  \ogPe{y\x),  oorresponds  io  (3  =  1  and  7  =  0; 

•  The  hinge  loss  of  struotured  SVMs,  Lsvm(^;  a:,  ?/)  =  m.axyi^y{^x)0^ {4>{t!c,y')  —  4>{x,y))  + 
i{y',  y),  oorresponds  to  the  limit  ease  /3  — >  cx)  and  any  7  >  0; 

•  The  loss  underlying  the  struotured  peroeptron  is  obtained  for  >  00  and  7  =  0. 

•  The  softmax-margin  loss  reoently  proposed  in  Gimpel  and  Smith  [2010]  is  obtained  with  f3  = 

7  =  1. 

For  any  ehoioe  of  /3  >  0  and  7  >  0,  the  resulting  loss  funetion  is  eonvex  in  0,  sinee,  up  to  a  soale 
faetor,  it  is  the  oomposition  of  the  (eonvex)  log-sum-exp  funetion  with  an  affine  map.'  In  §4  we 
present  a  dual  ooordinate  asoent  online  algorithm  to  handle  (2),  for  this  family  of  losses. 


2.2  A  Framework  for  Structured  Inference 

Two  important  inferenee  problems  are:  to  obtain  the  most  probable  assignment  {i.e.,  to  solve  (1)) 
and  to  eompute  marginals,  when  a  distribution  is  defined  on  y{x).  Both  problems  ean  be  ehalleng- 
ing  when  the  output  set  is  struetured.  Typieally,  there  is  a  natural  representation  of  the  elements  of 
y{x)  as  diserete-valued  veetors  y  =  y  =  {yi, . . .  ,yi)  e  yi  x  . . .  x  yj  =  y,  eaeh  y^  being  a  set 
of  labels  (/  may  depend  on  x).  We  eonsider  subsets  S  C  {1, . . . ,  /}  and  write  partial  assignment 
veetors  as  =  {yi)ies-  We  assume  a  one-to-one  map  (not  neeessarily  onto)  from  3^(x)  to  3^  and 
denote  by  5(a;)  C  ^  the  subset  of  representations  that  eorrespond  to  valid  outputs. 

The  next  step  is  to  design  how  the  feature  veetor  (f){x,  y)  deeomposes,  whieh  ean  be  eon- 
veniently  done  via  a.  factor  graph  [Ksehisehang  et  ah,  2001,  MeCallum  et  ah,  2009].  This  is  a 

'Some  important  non-convex  losses  can  also  be  written  as  differences  of  losses  in  this  family.  By  defining  = 

—  Lpy,  the  case  (3  =  1  yields  6Ljs^y{6;  x,  y)  —  log  Eg  exp£(Y,  y),  which  is  an  upper  bound  on  Eef(y,  y),  used 
in  minimum  risk  training  [Smith  and  Eisner,  2006].  For  f3  =  00,  becomes  a  structured  ramp  loss  [Collobert 
et  ah,  2006]. 
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root  John  hit  the  ball  with  the  bat 


Figure  1:  Example  of  a  dependency  parse  tree  (adapted  from  [McDonald  et  al.,  2005]). 

bipartite  graph  with  two  types  of  nodes:  variable  nodes,  which  in  our  case  are  the  /  components 
of  y;  and  a  set  C  of  factor  nodes.  Each  factor  node  is  associated  with  a  subset  C  C 
an  edge  connects  the  i\h  variable  node  and  a  factor  node  C  iff  i  G  C.  Each  factor  has  a  potential 
a  function  that  maps  assignments  of  variables  to  non-negative  real  values.  We  distinguish 
between  two  kinds  of  factors:  hard  constraint  factors,  which  are  used  to  rule  out  forbidden  par¬ 
tial  assignments  by  mapping  them  to  zero  potential  values,  and  soft  factors,  whose  potentials  are 
strictly  positive.  Thus,  C  =  Chard  U  Csoft-  We  associate  with  each  soft  factor  a  local  feature  vector 
4>ci^y  yc)  and  define 

4>{x,y)=  4>c{x,yc)-  (4) 

C'GCsoft 

The  potential  of  a  soft  factor  is  defined  as  yc)  =  exp(0^(^c’(a:,  yc))-  In  a  log-linear  prob¬ 
abilistic  model,  the  feature  decomposition  in  (4)  induces  the  following  factorization  for  the  condi¬ 
tional  distribution  of  Y : 


P9(Y  =  y\X  =  x)  =  ,^f—'[['i!c{x,yc),  (5) 

where  Z{6,  x)  =  J2y'£S(x)  Hcec  yf)  is  the  partition  function.  Two  examples  follow. 

Sequence  labeling:  Each  i  G  {l,...,/}isa  position  in  the  sequence  and  3^*  is  the  set  of  possible 
labels  at  that  position.  If  all  label  sequences  are  allowed,  then  no  hard  constraint  factors  exist.  In  a 
bigram  model,  the  soft  factors  are  of  the  form  C  =  {i,i  +  1}.  To  obtain  a  fc-gram  model,  redefine 
each  to  be  the  set  of  all  contiguous  {k  —  1) -tuples  of  labels. 

Dependency  parsing:  In  this  parsing  formalism  [Kiibler  et  ah,  2009],  each  input  is  a  sentence 
(i.e.,  a  sequence  of  words),  and  the  outputs  to  be  predicted  are  the  dependency  arcs,  which  link 
heads  to  modifiers,  and  overall  must  define  a  spanning  tree  (see  Eig.  1  for  an  example).  We  let 
each  i  =  {h,m)  index  a  pair  of  words,  and  define  3^^  =  {0, 1},  where  1  means  that  there  is  a  link 
from  h  to  m,  and  0  means  otherwise.  There  is  one  hard  factor  connected  to  all  variables  (call  it 
TREE),  its  potential  being  one  if  the  arc  configurations  form  a  spanning  tree  and  zero  otherwise. 
In  the  arc-factored  model  [Eisner,  1996,  McDonald  et  ah,  2005],  all  soft  factors  are  unary  and  the 
graph  is  a  tree.  More  sophisticated  models  {e.g.,  with  siblings  and  grandparents)  include  pairwise 
factors,  creating  loops  [Smith  and  Eisner,  2008]. 
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3  Variational  Inference 


3.1  Poly  topes  and  Duality 

Let  V  =  {-P0(.|  x)  \  0  G  be  the  family  of  all  distributions  of  the  form  (5),  and  rewrite  (4)  as: 

y)  =  yc)  =  F(x)  ■  x{y), 

c  sCsoft 

where  F(a;)  is  a  d-hy-k  feature  matrix,  with  k  =  I^ceCsoft  Hiec  each  column  containing  the 
vectors  0c(a;,  yc)  for  each  factor  C  and  configuration  yc;  and  x{y)  is  a  binary  /c-vector  indicating 
which  configurations  are  active  given  Y  =  y.  We  then  define  the  marginal  polytope 

Z{x)  =  convjz  e  I  3?/  G  y{x)  s.t.  z  =  x(|/)}, 

where  conv  denotes  the  convex  hull.  Note  that  Z{x)  only  depends  on  the  graph  and  on  the  speci¬ 
fication  of  the  hard  constraints  {i.e.,  it  is  independent  of  the  parameters  6)}  The  next  proposition 
(illustrated  in  Fig.  2)  goes  farther  by  linking  the  points  of  Z{x)  to  the  distributions  in  V.  Be¬ 
low,  H{Pe{.\x))  =  —  J2y£y{x)  Peiy\x)  logPe(|/|x)  denotes  the  entropy.  Eg  the  expectation  under 
Pe{-\  x),  and  zc{yc)  the  component  of  z  G  Z{x)  indexed  by  the  configuration  yc  of  factor  C. 

Proposition  1  There  is  a  map  coupling  each  distribution  P0{.\x)  E  V  to  a  unique  z  E  Z{x)  such 
thatE,g[x{Y)]  =  z.  Define  H{z)  =  H{Pe{.\x))  if  some  Po{.\x)  is  coupled  to  z,  and  H{z)  =  —oo 
if  no  such  Pe{.\x)  exists.  Then: 

1.  The  following  variational  representation  for  the  log -partition  function  holds: 

logZ(6,x)  =  max  6^F(x)z  +  H(z).  (6) 

z£Z{x) 

2.  The  problem  in  (6)  is  convex  and  its  solution  is  attained  at  the  factor  marginals,  i.e.,  there  is  a 
maximizer  z  s.t.  ^c(yc)  =  Prel^c  =  Yc}  for  each  C  E  C.  The  gradient  of  the  log-partition 
function  is  V  log  Z{6,  x)  =  F(x)z. 

3.  The  MAP  y  =  axgmayiy^yi^^-^  Pe{y\x)  can  be  obtained  by  solving  the  linear  program 

z  =  x{y)  =  argmax0^F(a:)z.  (7) 

z^2r{x) 


Proof:  [Wainwright  and  Jordan,  2008,  Theorem  3.4]  provide  a  proof  for  the  canonical  over¬ 
complete  representation  where  F(x)  is  the  identity  matrix,  i.e.,  each  feature  is  an  indicator  of  the 
configuration  of  the  factor.  In  that  case,  the  map  from  the  parameter  space  to  the  relative  inte¬ 
rior  of  the  marginal  polytope  is  surjective.  In  our  model,  arbitrary  features  are  allowed  and  the 
parameters  are  tied,  since  they  are  shared  by  all  factors.  This  can  be  expressed  as  a  linear  map 

^The  marginal  polytope  can  also  be  defined  as  the  set  of  factor  marginals  realizable  by  distributions  that  factor  ac¬ 
cording  to  the  graph.  Log-linear  models  with  canonical  overcomplete parametrization — i.e.,  whose  sufficient  statistics 
(features)  at  each  factor  are  conhguration  indicators — are  studied  in  Wainwright  and  Jordan  [2008]. 
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Figure  2:  Dual  parametrization  of  the  distributions  in  V.  The  original  parameter  is  linearly  mapped 
to  the  faetor  log-potentials,  the  canonical  overcomplete  parameter  space  Wainwright  and  Jordan 
[2008],  whieh  is  mapped  onto  the  relative  interior  of  the  marginal  polytope  Z{x).  In  general  only 
a  subset  of  Z{x)  is  reaehable  from  our  parameter  spaee. 


6  I— ^  s  =  F(x)^0  that  “plaees”  our  parameters  0  G  onto  a  linear  subspace  of  the  eanonieal 
overeomplete  parameter  spaee;  therefore,  our  map  0  i— z  is  not  neeessarily  onto  nZ{x),  unlike  in 
Wainwright  and  Jordan  [2008],  and  our  H{z)  is  defined  slightly  differently:  it  ean  take  the  value 
—oo  if  no  0  maps  to  z.  This  does  not  affeet  the  expression  in  (6),  sinee  the  solution  of  this  opti¬ 
mization  problem  with  our  H{z)  replaeed  by  theirs  is  also  the  feature  expeetation  under  Pq{.\x) 
and  the  assoeiated  z,  by  definition,  always  yields  a  finite  H{z).  ■ 

3.2  Loss  Evaluation  and  Differentiation 

We  now  invoke  Prop.  1  to  derive  a  variational  expression  for  evaluating  any  loss  L0^^{6]x^y)  in 
(3),  and  eompute  its  gradient  as  a  by-produet.^  This  is  erueial  for  the  learning  algorithms  to  be 
introdueed  in  §4.  Our  only  assumption  is  that  the  eost  funetion  y)  ean  be  written  as  a  sum 
over  faetor-loeal  eosts;  letting  z  =  xiv)  ^nd  z!  =  this  implies  l{y' ,  y)  =  p^z'  -f  q  for  some 

p  and  q  whieh  are  constant  with  respect  to  z'^  Under  this  assumption,  and  letting  s  =  F(x)^0  be 
the  vector  of  factor  log-potentials,  Lp^.y{6;  x,  y)  becomes  expressible  in  terms  of  the  log-partition 
function  of  a  distribution  whose  log-potentials  are  set  to  +  7p).  From  (6),  we  obtain 

L0,^{0]X,y)=  max  0^F(a:)(z'-z)  +  ^if(z')+7(P^z'  +  g)-  (8) 

7,'&Z(x)  p 

Let  z  be  a  maximizer  in  (8);  from  the  second  statement  of  Prop.  1  we  obtain  the  following  expres¬ 
sion  for  the  gradient  of  at  6\ 


VL0^^{e]x,y)  =  ¥{x){z-z).  (9) 

For  concreteness,  we  revisit  the  examples  discussed  in  the  previous  subsection. 

^Our  description  also  applies  to  the  (non-differentiable)  hinge  loss  case,  when  (3  oo,  if  we  replace  all  instances 
of  “the  gradient”  in  the  text  by  “a  subgradient.” 

"^For  the  Hamming  loss,  this  holds  with  p  =  1  —  2z  and  q  =  l^z.  See  Taskar  et  al.  [2006]  for  other  examples. 
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Sequence  Labeling.  Without  hard  constraints,  the  graphical  model  does  not  contain  loops,  and 
therefore  Li3^^{6-,  x,  y)  and  VLy^^{6-,  x,  y)  may  be  easily  computed  by  setting  the  log-potentials  as 
described  above  and  running  the  forward-backward  algorithm. 

Dependency  Parsing.  For  the  arc-factored  model,  x,  y)  and  x,  y)  may  be  com¬ 

puted  exactly  by  modifying  the  log-potentials,  invoking  the  matrix-tree  theorem  to  compute  the 
log-partition  function  and  the  marginals  [Smith  and  Smith,  2007,  Koo  et  ah,  2007,  McDonald  and 
Satta,  2007],  and  using  the  fact  that  H{z)  =  logZ{6,x)  —  0^F(x)z.  The  marginal  polytope  is 
the  same  as  the  arborescence  polytope  in  Martins  et  al.  [2009].  For  richer  models  where  arc  inter¬ 
actions  are  considered,  exact  inference  is  intractable.  Both  the  marginal  poly  tope  and  the  entropy, 
necessary  in  (6),  lack  concise  closed  form  expressions.  Two  approximate  approaches  have  been 
recently  proposed:  a  loopy  belief  propagation  (BP)  algorithm  for  computing  pseudo-marginals 
[Smith  and  Eisner,  2008];  and  an  LP-relaxation  method  for  approximating  the  most  likely  parse 
tree  [Martins  et  ah,  2009].  Although  the  two  methods  may  look  unrelated  at  first  sight,  both  opti¬ 
mize  over  outer  bounds  of  the  marginal  polytope.  See  [Martins  et  ah,  2010]  for  further  discussion. 

4  Online  Learning 

We  now  propose  a  dual  coordinate  ascent  approach  to  learn  the  model  parameters  6.  This  approach 
extends  the  primal-dual  view  of  online  algorithms  put  forth  by  Shalev-Shwartz  and  Singer  [2006] 
to  structured  classification;  it  handles  any  loss  in  (3).  In  the  case  of  the  hinge  loss,  we  recover  the 
online  passive-aggressive  algorithm  (also  known  as  MIRA,  [Crammer  et  ah,  2006])  as  well  as  its 
/c-best  variants.  With  the  logistic  loss,  we  obtain  a  new  passive-aggressive  algorithm  for  CRFs. 

Start  by  noting  that  the  learning  problem  in  (2)  is  not  affected  if  we  multiply  the  objective  by 
m.  Consider  a  sequence  of  primal  objectives  Pi(0), . . . ,  Pm+i{0)  to  be  minimized,  each  of  the 
form 

t-i 

Pt{6)  =  XmR{0)  +  x^,  Pi). 

i=l 

Our  goal  is  to  minimize  Pm+i{0)  \  for  simplicity  we  consider  online  algorithms  with  only  one  pass 
over  the  data,  but  the  analysis  can  be  extended  to  the  case  where  multiple  epochs  are  allowed. 

Below,  we  let  M  =  M  U  {-fcx)}  be  the  extended  reals  and,  given  a  function  /  :  M,  we 

denote  by  /*  :  M"  — M  its  convex  conjugate,  /*(y)  =  sup^x^y  —  /(x)  (see  Appendix  A  for 
a  background  of  convex  analysis).  The  next  proposition,  proved  in  [Kakade  and  Shalev-Shwartz, 
2008],  states  a  generalized  form  of  Fenchel  duality,  which  involves  a  dual  vector  G  per  each 
instance. 

Proposition  2  ([Kakade  and  Shalev-Shwartz,  2008])  The  Lagrange  dual  o/mine  Pt{0)  is 

max 

Ml  viMt-i 

where 

(  1  \ 

=  -XmR*  -J2L*{pi;Xi,y,i}.  (10) 
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Algorithm  1  Dual  coordinate  ascent  (DCA) 

Input:  V,  A,  number  of  iterations  K 
Initialize  0i  =  0;  set  m  =  \V\  and  T  =  mK 
for  t  =  1  to  T  do 

Receive  an  instance  Xt,  yt 
Update  6t+i  by  solving  (11)  exactly  or  ap¬ 
proximately  (see  Alg.  2) 
end  for 

Return  the  averaged  model  6  ^  J2t=i 


Algorithm  2  Parameter  updates 

Input:  current  model  Ot,  instance  {xt,yt),  A 

Obtain  from  yt 

Solve  the  variational  problem  in  (8)  to  obtain 

zt  and  Lfj^^{0t,xt,yt) 

Compute  VLfs^^{et,xt,yt)  =  F{xt){zt-zt) 

Compute  =  min  {^, 

Return  6t+i  =6t-  r]tVL{6t]  xt,  yt) 


ifR{6)  =  iwef,  then  R  =  R*,  and  strong  duality  holds  for  any  convex  L,  i.e.,  PtiO*)  = 
Dt{R[, . . . ,  where  0*  and  gif  . . . ,  gL*t_i  are  respectively  the  primal  and  dual  optima.  More¬ 
over,  the  following  primal-dual  relation  holds:  0*  =  —^  Z]i=i  Ri- 

We  can  therefore  transform  our  problem  into  that  of  maximizing  Dm+i{p^i,  •  •  ■ ,  Rm)-  Dual  co¬ 
ordinate  ascent  (DCA)  is  an  umbrella  name  for  algorithms  that  manipulate  a  single  dual  coor¬ 
dinate  at  a  time.  In  our  setting,  the  largest  such  improvement  at  round  t  is  achieved  by  Pt  — 
argmax^  Dt+i{pi, . . . ,  Pt-i:  r)-  The  next  proposition,  proved  in  Appendix  B,  characterizes  the 
mapping  of  this  subproblem  back  into  the  primal  space,  shedding  light  on  the  connections  with 
known  online  algorithms. 

Proposition  3  Let  0t  —  Ri-  The  Lagrange  dual  o/max^  Dt+i{pi, . . . ,  Pt-i-,  r) 

\'ffl 

mju— ||0-0J^-hL(0;xt,|/t).  (11) 

6  Z 

Assembling  these  pieces  together  yields  Alg.  1,  where  the  solution  of  (11)  is  carried  out  by 
Alg.  2,  as  explained  next.^  While  the  problem  in  (11)  is  easier  than  the  batch  problem  in  (2),  an 
exact  solution  may  still  be  prohibitively  expensive  in  large-scale  settings,  particularly  because  it  has 
to  be  solved  repeatedly.  We  thus  adopt  a  simpler  strategy  that  still  guarantees  some  improvement 
in  the  dual.  Noting  that  L  is  non-negative,  we  may  rewrite  (1 1)  as 

vaiY).e^^^\\0  -  0t\\‘^  +  i  s.t.  L{0;  xt,yt)  <  ^>0.  (12) 

From  the  convexity  of  L,  we  may  take  its  first-order  Taylor  approximation  around  0t  to  obtain  the 
lower  bound  L(0;  xt,  yt)  >  L{0t]  Xt,  yt)  +  {0  —  0t)^VL{0t]  Xt,  yt)-  Therefore  the  true  minimum 
in  (1 1)  is  lower  bounded  by 

min  ^110  —  0dP  -f  f 
2  II 

s.t.  L{0pxt,yt)  +  {0  -  0t)'^L{0pxt,yt)  ^>0. 


^The  final  averaging  step  in  Alg.  1  is  an  online-to-batch  conversion  with  good  generalization  guarantees  Cesa- 
Bianchi  et  al.  [2004]. 
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This  is  a  Euclidean  projection  problem  (with  slack)  that  admits  the  closed  form  solution  = 

6t  -  r]tVL{et;  Xt,  yt),  with 


rjt  =  mm 


\  Am  ’ 


L{et\xt,yt) 

\\^L{ef,xt,ytW 


}■ 


(14) 


Example:  1-best  MIRA.  If  Lis  the  hinge-loss,  we  obtain  from  (9)  =  F(a;t)(zi- 

zt)  =  0(xt,  yt)  -  yt),  where  yt  =  argmaXj^,gy(^^^  Oj (0(a;t,  y[)  -  (f){xt,  yt))  +  i{y't,  yt)-  The 


update  becomes  6t+i  =  6t-  Vtiff^ixt,  yt)  -  yt)),  with 


,  [  1  ej{(f){xt,yt)-(t>{xt,yt))  +  iiyt,yt)] 

"*  =  - UM)-4,{xumW - r 

This  is  precisely  the  max-loss  variant  of  the  1-best  MIRA  algorithm  [Crammer  et  ah,  2006,  §8]. 
Hence,  while  MIRA  was  originally  motivated  by  a  conservativeness-correctness  tradeoff,  it  turns 
out  that  it  also  performs  coordinate  ascent  in  the  dual. 


Example:  CRFs.  This  framework  immediately  allows  us  to  extend  1-best  MIRA  for  CRFs, 
which  optimizes  the  logistic  loss.  In  that  case,  the  exact  problem  in  (12)  can  be  expressed  as 

^\\e  -  Otf  +  ^  s.t.  -  logPfl(|/i|a:t)  <  ^,  ^>0. 


In  words:  stay  as  close  as  possible  to  the  previous  parameter  vector,  but  correct  the  model  so 
that  the  conditional  probability  Pe{yt\xt)  becomes  large  enough.  From  (9),  V Lc^-p{6t]  Xt,yt)  = 
F{xt){zt  —  Zt)  =  ¥.0^(f){xt,  Yt)  —  (p{xt,  yt),  where  now  Zt  is  an  expectation  instead  of  a  mode.  The 
update  becomes  6t+i  =  6t-  rit{Ee^4){xt,  Yt)  -  (f){xt,yt)),  with 


[  1  sJ{^et4>(^t,Yt)-4>(x:t,yt))+H{Pg^(.\xt))  \  _  •  [  1  -^ogPetiytM  \ 

(Am’  \\Eg^(f>{xt,Yt)-(f>{xt,yt)P  J  \m’  jjEg^<p(xt,Yt)-<p(xt,yt)ll^  J  ' 


(16) 


Thus,  the  difference  with  respect  to  standard  1-best  MIRA  (15)  consists  of  replacing  the  feature 
vector  of  the  loss-augmented  mode  4>{xt,yt)  by  the  expected  feature  vector  E,e^(f){xt,  Yt)  and  the 
cost  function  i{yt,  yt)  by  the  entropy  function  H{P0^{.\xt)). 


Example:  fc-best  MIRA.  Tighter  approximations  to  the  problem  in  (1 1)  can  be  built  by  using  the 
variational  representation  machinery;  see  (8)  for  losses  in  the  family  Plugging  this  variational 
representation  into  the  constraint  in  (12)  we  obtain  the  following  semi-infinite  quadratic  program: 

min  ^\\6-etf  +  ^ 

s.t.  e  Eniz[;(3,^),  \/z[eZ{x)  (17) 

e>o. 

where  H{z';  z,  /3, 7)  =  {0  |  sl^O  <  b}  is  a  half-space  with  a  =  F(x)(z'  —  z)  and  b  =  ^  — 7(p^z'-|- 
q)  —  (3~^H{z') .  The  constraint  set  in  (17)  is  a  convex  set  defined  by  the  intersection  of  uncountably 
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many  half-spaces  (indexed  by  the  points  in  the  marginal  polytope).^  Our  approximation  eonsisted 
of  relaxing  the  problem  in  (17)  by  disearding  all  half-spaees  exeept  the  one  indexed  by  zt,  the  dual 
parameter  of  the  eurrent  iterate  6u  however,  tigher  relaxations  are  obtained  by  keeping  some  of  the 
other  half-spaees.  For  the  hinge  loss,  rather  than  just  using  the  mode  z*,  one  may  rank  the  fc-best 
outputs  and  add  a  half-spaee  eonstraint  for  eaeh.  This  proeedure  approximates  the  eonstraint  set 
by  a  polyhedron  and  the  resulting  problem  ean  be  addressed  using  row-aetion  methods,  sueh  as 
Hildreth’s  algorithm  [Censor  and  Zenios,  1997],  This  eorresponds  preeisely  to  fc-best  MIRA.’ 


5  Experiments 

We  report  experiments  on  two  tasks:  named  entity  reeognition  and  dependeney  parsing.  For  eaeh, 
we  eompare  DCA  (Alg.  1)  with  SGD.  We  report  results  for  several  values  for  the  regularization 
parameter  C  =  l/(Am).  To  ehoose  the  learning  rate  for  SGD,  we  use  the  formula  r]t  =  r]/(l  + 
{t  —  l)/m)  [LeCun  et  ah,  1998].  We  ehoose  t]  using  dev-set  validation  after  a  single  epoeh  [Collins 
et  ah,  2008]. 

Named  Entity  Recognition.  We  use  the  English  data  from  the  CoNLL  2003  shared  task  [Tjong 
Kim  Sang  and  De  Meulder,  2003],  whieh  eonsist  of  English  news  artieles  annotated  with  four 
entity  types:  person,  loeation,  organization,  and  miseellaneous.  We  used  a  standard  set  of  feature 
templates,  as  in  [Kazama  and  Torisawa,  2007],  with  token  shape  features  [Collins,  2002b]  and 
simple  gazetteer  features;  a  feature  was  ineluded  iff  it  oeeurs  at  least  once  in  the  training  set  (total 
1,312,255  features).  The  task  is  evaluated  using  the  Fi  measure  eomputed  at  the  granularity  of 
entire  entities.  We  set  /3  =  1  and  7  =  0  (the  CRE  ease).  In  addition  to  SGD,  we  also  eompare  with 
E-BEGS  [Eiu  and  Nocedal,  1989],  a  eommon  ehoiee  for  optimizing  eonditional  log-likelihood. 
We  used  {10“,  a  =  —3, . . . ,  2}  for  the  set  of  values  eonsidered  for  7  in  SGD.  Eig.  3  shows  that 
DCA  (whieh  only  requires  tuning  one  hyperparameter)  reaehes  better-performing  models  than  the 
baselines. 

Dependency  Parsing.  We  trained  non-projeetive  dependeney  parsers  for  three  languages  (Ara¬ 
ble,  Danish,  and  English),  using  datasets  from  the  CoNEE-X  and  CoNEE-2008  shared  tasks  [Buch- 
holz  and  Marsi,  2006,  Surdeanu  et  ah,  2008].  Performance  is  assessed  by  the  unlabeled  attaehment 
seore  (UAS),  the  fraetion  of  non-punetuation  words  whieh  were  assigned  the  correet  parent.  We 
adapted  TurboParser*^  to  handle  any  loss  funetion  Li3^^  via  Alg.  1;  for  deeoding,  we  used  the  loopy 
BP  algorithm  of  Smith  and  Eisner  [2008]  (see  §3.2).  We  used  the  pruning  strategy  in  [Martins 
et  ah,  2009]  and  tried  two  feature  eonfigurations:  an  are-faetored  model,  for  whieh  deeoding  is 
exaet,  and  a  model  with  seeond-order  features  (siblings  and  grandparents)  for  whieh  it  is  approxi¬ 
mate.  The  eomparison  with  SGD  for  the  CRE  ease  is  shown  in  Eig.  4.  Eor  the  are-faetored  models, 

^Interestingly,  when  the  hinge  loss  is  used,  only  a  finite  (albeit  exponentially  many)  of  these  half-spaces  are  neces¬ 
sary,  those  indexed  by  vertices  of  the  marginal  polytope.  In  this  case,  the  constraint  set  is  polyhedral. 

^The  prediction-based  variant  of  1-best  MIRA  [Crammer  et  al.,  2006]  is  also  a  particular  case,  where  is  the 
prediction  under  the  current  model  Ot,  rather  than  the  mode  of  TsvM(0t,  Xt,yt)- 

^Available  at  http  :  /  / www .  ark  .  cs  .  emu  .  edu/TurboParser. 
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DCA,C=  10 
-E^DCA,C=  1 
^^DCA,C  =  0.1 
-^-SGD.ti  =  1,C=  10 
I  -  ^-SGD,r|  =  0.1,C  =  1 
I  -  ^-SGD.ti  =  0.1,C  =  0.1 
L-BFGS,  C=  1 


Figure  3:  Named  entity  recognition.  Learning  curves  for  DCA  (Alg.  1),  SGD,  and  L-BFGS.  The 
SGD  curve  for  G  =  10  is  lower  than  the  others  because  dev-set  validation  chose  a  suboptimal  value 
of  T].  DCA,  by  contrast,  does  not  require  choosing  any  hyperparameters  other  than  C.  L-BFGS 
ultimately  converges  after  121  iterations  to  an  Fi  of  90.53  on  the  development  data  and  85.31  on 
the  test  data. 


3 

1 

1 

1 

1 

3 

5 

(X) 

1 

0  (CRF) 

1 

3 

5 

1 

1 

1  (SVM) 

MFP 

BEST  C 

1.0 

10.0 

1.0 

1.0 

1.0 

1.0 

1.0 

Fi  (%) 

85.48 

85.54 

85.65 

85.72 

85.55 

85.48 

85.41 

DEPENDENCY 

BEST  C 

0.1 

0.01 

0.01 

0.01 

0.01 

0.01 

0.1 

PARSING 

UAS  (%) 

90.76 

90.95 

91.04 

91.01 

90.94 

90.91 

90.75 

Table  1 :  Varying  (3  and  7:  neither  the  CRF  nor  the  SVM  are  optimal.  We  report  only  the  results  for 
the  best  C,  chosen  from  {0.001,  0.01,  0.1, 1}  with  dev-set  validation.  For  named  entity  recognition, 
we  show  test  set  Fi  after  iT  =  50  iterations  (empty  cells  will  be  filled  in  in  the  final  version). 
Dependency  parsing  experiments  used  the  arc-factored  model  on  English  and  iT  =  10. 

the  learning  curve  of  DCA  seems  to  lead  faster  to  an  accurate  model.  Notice  that  the  plots  do  not 
account  for  the  fact  that  SGD  requires  four  extra  iterations  to  choose  the  learning  rate.  For  the 
second-order  models  of  Danish  and  English,  however,  DCA  did  not  perform  as  well.^ 

Einally,  Table  1  shows  results  obtained  for  different  settings  of  (3  and  7.^°  Interestingly,  we 
observe  that  the  higher  scores  are  obtained  for  loss  functions  that  are  “between”  SVMs  and  CREs. 


6  Conclusion 

We  presented  a  general  framework  for  aggressive  online  learning  of  structured  classifiers  by  op¬ 
timizing  any  loss  function  in  a  wide  family.  The  technique  does  not  require  a  learning  rate  to  be 
specified.  We  derived  an  efficient  technique  for  evaluating  the  loss  function  and  its  gradient.  Exper- 

^Further  analysis  showed  that  for  ~15%  of  the  training  instances,  loopy  BP  led  to  very  poor  variational  approxi¬ 
mations  of  \ogZ{9,  x),  yielding  estimates  Pg^  {yt\xt)  >  1,  thus  a  negative  learning  rate  (see  (16)),  that  we  truncate 
to  zero.  Thus,  no  update  occurs  for  those  instances,  explaining  the  slower  convergence.  A  possible  way  to  fix  this 
problem  is  to  use  techniques  that  guarantee  upper  bounds  on  the  log-partition  function  Wainwright  and  Jordan  [2008]. 

^"Observe  that  there  are  only  two  degrees  of  freedom:  indeed,  (A,  jS,  7)  and  (A',  /3',  7')  lead  to  equivalent  learning 
problems  if  A'  =  A/a,  /3'  =  /3/a  and  7'  =  07  for  any  a  >  0,  with  the  solutions  related  via  6'  =  a9. 
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Figure  4:  Learning  curves  for  DC  A  (Alg.  1)  and  SGD,  the  latter  with  the  learning  rate  rj  =  0.01 
chosen  from  {0.001, 0.01,  0.1,1}  using  the  same  procedure  as  before.  The  instability  when  training 
the  second-order  models  might  be  due  to  the  fact  that  inference  there  is  approximate. 


iments  in  named  entity  recognition  and  dependency  parsing  showed  that  the  algorithm  converges 
to  accurate  models  at  least  as  fast  as  stochastic  gradient  descent. 
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A  Background  on  Convex  Analysis 


We  briefly  review  some  notions  of  convex  analysis  that  are  used  throughout  the  paper.  For  more 
details,  see  e.g.  Boyd  and  Vandenberghe  [2004].  Below,  A'^  =  {/^  G  |  > 

0  Vj}  is  the  probability  simplex  in  and  B^(x)  =  {y  G  |  ||y  —  x||  <  7}  is  the  ball  with 
radius  7  centered  at  x. 

A  set  C  C  is  convex  if  /rx  +  (1  —  /i)y  G  C  for  all  x,  y  G  C  and  /i  G  [0, 1].  The  convex  hull 
of  a  set  A”  C  is  the  set  of  all  convex  combinations  of  the  elements  of  A, 

conv  A  =  <  ^  /ijXj 
U=i 

it  is  also  the  smallest  convex  set  that  contains  A.  The  affine  hull  of  A  C  is  the  set  of  all  affine 
combinations  of  the  elements  of  A, 


aff  A 


>  1 


i=i 


) 


it  is  also  the  smallest  affine  set  that  contains  A.  The  relative  interior  of  A  is  its  interior  relative  to 
the  affine  hull  A, 


relint  A  =  {x  G  A  |  37  >  0  :  B-yix.)  fl  aff  A  C  A}. 

Let  M  =  ]RU{+cx)}be  the  extended  reals.  The  effective  domain  of  a  function  /  :  M  is 

the  set  dom  /  =  {x  G  |  /(x)  <  +cxo}.  /  is  proper  if  dom  /  7^  0.  The  epigraph  of  /  is  the  set 
epi/  =  {(x,  f)  G  X  M  I  f{x)  <  t}.  f  is  lower  semicontinuous  (Isc)  if  the  epigraph  is  closed  in 
X  M.  /  is  convex  if  dom  /  is  a  convex  set  and 

/(/ix+ (1 -/i)y)  < /i/(x)  +  (1 -/i)/(y),  Vx,yGdom/,  /rG[0, 1]. 

The  (Fenchel)  conjugate  of  /  is  the  function  /*  :  M  defined  as 

r(y)  =  sup  x^y  -  /(x). 

f*  is  always  convex,  since  it  is  the  supremum  of  a  family  of  affine  functions.  Some  examples 
follow: 

•  If  /  is  an  affine  function,  /(x)  =  a^x  +  b,  then  /*(y)  =  — 6  if  y  =  a  and  —00  otherwise. 

•  If  /  is  the  f'p-norm,  /(x)  =  ||x||p,  then  /*  is  the  indicator  of  the  unit  ball  induced  by  the  dual 
norm,  /*(y)  =  0  if  ||y||q  <  1  and  +cx)  otherwise,  with  p~^  +  q~^  =  1. 

•  If  /  is  half  of  the  squared  f'p-norm,  /(x)  =  ||x||p/2,  then  /*  is  half  of  the  squared  dual  norm, 

/*(y)  =  l|y||g/2,  withp-i  +  =  1. 

•  If  /  is  convex,  Isc,  and  proper,  then  /**  =  /. 

•  If  g{x.)  =  f/(x  -  xo),  with  t  G  M+  and  xq  G  then  g*{y)  =  X([y  +  tf*{y/t). 
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B  Proof  of  Proposition  3 

From  (10), 


max - - — 

M  2Xm 


2=1 


i 

max - - — 

M  2Am 


i  2 

max - - —  \\—\m6t  +  ^||  —  Xt,  yt)  +  constant 

M  2Am 

1  ..  2 

Am0f  + /^ll  —  max (/^  0  —  L(0;  i/t))  +  constant 

0 

L{0]  Xt,  yt)  constant 
—  XmOtW"^^  +  L{0;xt,yt)  +  constant 


\\-Xm6t  +  Ai|| 

I  Am, 

1 


1 

]2iin  max - ; — 

e  M  2Xm 

=  min  (  max  0) - ; — llu 

0  V  M  ^  ^  ^  2Xm 

=(iii)  ]22in  ||0  _  constant, 

0  2 


(18) 


where  in  (i)  we  invoked  the  definition  of  eonvex  eonjugate;  in  (ii)  we  interehange  min  and  max 
sinee  strong  duality  holds  (as  stated  in  [Kakade  and  Shalev-Shwartz,  2008],  a  suffieient  eondition 
is  that  R  is  strongly  eonvex,  L  is  eonvex  and  dom  L  is  polyhedral);  and  in  (hi)  we  used  the  faets  that 
R{6)  =  ||0|p/2  is  eonjugate  ofitself,  and  that  (7(x)  =  f/(x—xo)  implies  5f*(y)  =  Xg  y+t/*(y/t). 
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